Stats 1 - General Linear Model

R Coding Club
RTG 2660

Dr. Lea Hildebrandt

2024-04-16

Basic Statistics

No warm-up today! It will be a bit more input today…

Today (and next week), we will try to cover a lot!

Fitting Models to Data

What is a Model?

” “models” are generally simplifications of things in the real world that nonetheless convey the essence of the thing being modeled”

“All models are wrong but some are useful” (G. Box)

(ST21, Ch 5)

Aim: Find the model that most efficiently and accurately summarizes the way in which the data were actually generated.

Basic structure of statistical models:

\[ data=model+error \]

Statistical Models

In general, we want to predict single observations (denoted by i) from the model. The fact that we are looking at predictions of observations and not actual values of the data is denoted by the “hat”:

\[ \widehat{data_i} = model_i \] The error is then simply the deviation of the actual data from the predicted values:

\[ error_i = data_i - \widehat{data_i} \] If this doesn’t make much sense yet, let’s look at an example.

A Simple Model

Let’s say we want to have a model of height of children (in the NHANES dataset also used in ST21).

What do you think would be a good model?

A Simple Model 2

The simplest model would be… the mean of the height values! This would imply that the model would predict the same height for everyone - and all individual deviations would be part of the error term.

We can write such a simple model as a formula:

\[ y_i = \beta + \epsilon \]

\(y_i\) denotes the individual observations (hence the \(i\)) of heights, \(\beta\) is a so-called parameter, and \(\epsilon\) is the error term. In this example, the parameter \(\beta\) would be the same value (= the mean height) for everyone (hence it doesn’t need a \(i\) subscript). Parameters are values that we estimate to find the best model.

A Simple Model 3

How do we find parameters that belong to the best fitting model?

We try to minimize the error!

Remember, the error is the difference between the actual and predicted values of \(y\) (height):

\[ error_i = y_i - \hat{y_i} \]

If we select a predicted value of 400cm, all individuals’ errors would hugely deviate (because no one is 4m tall). If we average these errors, it would still be a big value.

A better candidate for such a simple model is thus the arithmetic mean or average:

\[ \bar{X} = \frac{\sum_{i=1}^{n}x_i}{n} \]

Summing up all individual’s heights and dividing that number by the number of individuals gives us the mean. By definition (see book for proof, the individual errors cancel out), the average error is now 0!

A Note on Errors

We usually don’t simply average across the individual errors, but across the squared errors.

The reason is that positive and negative errors cancel each other out, which is not the case when squared.

The mean squared error would be in a different unit then the data (squared!), which is why we usually take the square root of that value to bring it back to the original unit: This leaves us with the root mean squared error (RMSE)!

A Slightly More Complex Model

Obviously, the model for predicting height from the average is not very good: It predicts the same height for all children! (The RMSE is 27 cm!)

How can we improve this model?

We can account for other information that we might have!
For example, to account for age might be a good idea: Older children are likely taller than younger ones. We plot height against age to visually inspect the relationship:

A Slightly More Complex Model 2

As we can see, the line (~ model) fits the data points increasingly well, e.g. if we include a constant (or intercept) and age. We would write this as this formula:

\[ \hat{y_i} = \hat{\beta_0} + \hat{\beta_1} * age_i \]

Remember from linear algebra that this defines a line:

\[ y = slope * x + intercept \]

Thus \(\beta_0\) is the parameter for the intercept and \(\beta_1\) for the slope of age!

The model fit is now much better: RMSE = 8.36 cm.

Adding gender? Does not improve model too much!

What is a “Good” Model?

Two aims:

  1. Describe data well (= low error/RMSE)

  2. Generalize to new data (low error when applied to new data)

Can be conflicting!

Where does error come from? (In addition to individual differences!)

  • measurement error (noise): random variation in data

    • actual measurement is biased (broken device, bias etc.)

    • “thing measured” may be biased/varies a lot

  • wrong model specification

    • e.g. height goes down with age

    • important variable is missing from model (age!)

Examples Measurement Error

Simulated relationship between blood alcohol content and reaction time on a driving test, with best-fitting linear model represented by the line. A: linear relationship with low measurement error. B: linear relationship with higher measurement error. C: Nonlinear relationship with low measurement error and (incorrect) linear model

Can a Model be too Good?

Yes! This is called overfitting.

If we fit a line too closely to the data, the model might not be able to generalize to other data well.

An example of overfitting. Both datasets were generated using the same model, with different random noise added to generate each set. The left panel shows the data used to fit the model, with a simple linear fit in blue and a complex (8th order polynomial) fit in red. The root mean square error (RMSE) values for each model are shown in the figure; in this case, the complex model has a lower RMSE than the simple model. The right panel shows the second dataset, with the same model overlaid on it and the RMSE values computed using the model obtained from the first dataset. Here we see that the simpler model actually fits the new dataset better than the more complex model, which was overfitted to the first dataset.

Summarizing Data

Central Tendency

Why summarize data?

It’s a model & describes the data! E.g. the mean = central tendency of the data

Mean, Median, Mode?

Mean = minimizes sum of squared error, but highly influenced by outliers!
Median = “middle” value if ranked, minimizes sum of absolute error, less influenced by extreme values
Mode = most often occurring value

Example:

If 3 people earn 10,000 Euros per year and 1 person earns 1,000,000:
Mean: 257,500 Euros
Median: (Rank: 10,000; 10,000; 10,000; 1,000,000 -> middle value = )10,000 Euros
Mode: 10,000 Euros

Variability

How widespread are the data?

Variance and Standard Deviation

Variance ≊ Mean Squared Error

\[ \sigma^2 = \frac{SSE}{N} = \frac{\sum_{i=1}^n (x_i - \mu)^2}{N} \]

(Note: \(x_i\) = value of ind. observation, \(\mu\) = population mean instead of \(\hat{X}\) = sample mean)

Standard Deviation ≊ Root Mean Squared Error

\[ SD = \sigma = \sqrt{\sigma^2} \]

We usually don’t know the population mean \(\mu\), thats why we estimate the sample variance (with the “hat”):

\[ \hat\sigma^2 = \frac{\sum_{i=1}^n (x_i - \hat{X})^2}{n-1} \]

Note: we now use \(\hat{X}\) and \(n\) for the sample size. \(n-1\) is used to make the estimate more robust/less biased.

Z-Scores

\[ Z(x) = \frac{x - \mu}{\sigma} \]

  • standardizes the distribution: How far is any data point from the mean in units of SD?
  • doesn’t change original relationship of data points!
    • shifts distribution to have a mean = 0 and SD = 1.
  • useful if we compare (or use in a model) variables on different scales/units!

Density (top) and cumulative distribution (bottom) of a standard normal distribution, with cutoffs at one standard deviation above/below the mean.

Probability, Sampling, Null-Hypothesis Testing…

Let’s skip this (for now?) - but you should know that there are different probability distributions (normal, uniform etc.) and that we draw samples from the population when we run experiments…
(Also, some classical probability theory and resampling methods like Monte Carlo simulations are helpful to know).

We also skip Null Hypothesis Significance Testing, which is the main approach we’re using in psychology etc., as well as Confidence Intervals, Effect Sizes… You should have a rough idea of these concepts.

The General Linear Model

Remember the basic model of statistics:

\[ data = model + error \]

Our general goal is to find the model with the best fit, i.e. that minimizes the error.

One approach is the GLM. You might be surprised that a lot of the common models can be viewed as linear models:

All models can be thought of as linear models

Definitions

Dependent variable (DV): The outcome variable that the model aims to explain (\(Y\)).

Independent variable (IV): The variable that we use to explain the DV (\(X\)).

Linear model: The model for the DV is composed of a linear combination of IVs (that are multiplied by different weights!)

The weights are the parameters \(\beta\) and determine the relative contribution of each IV. (This is what the model estimates! The weights thus give us the important information we’re usually interested in: How strong are IV and DV related.)

There may be several DVs, but usually that’s not the case and we will focus on those cases with one DV!

Example

Let’s use some simulated data:

We can calculate the correlation between the two variables:


    Pearson's product-moment correlation

data:  df$grade and df$studyTime
t = 2, df = 6, p-value = 0.09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.13  0.93
sample estimates:
 cor 
0.63 

The correlation is quite high (.63), but the CI is also pretty wide.

Fundamental activities of statistics:

  • Describe: How strong is the relationship between grade and study time?

  • Decide: Is there a statistically significant relationship between grade and study time?

  • Predict: Given a particular amount of study time, what grade do we expect?

Linear Regression

Use the GLM (~synonymous to linear regression) to…

  • decribe the relation between two variables (similar to correlation)

  • predict DV for new values of IV (new observations)

  • add multiple IVs!

Simple GLM:

\[ y = \beta_0+ x * \beta_x + \epsilon \]

\(\beta_0\) = intercept, the overall offset of the line when \(x=0\) (even if that is impossible)
\(\beta_x\) = slope, how much do we expect \(y\) to change with each change in \(x\)?
\(y\) = DV
\(x\) = IV or predictor
\(\epsilon\) = error term, whatever variance is left over once the model is fit, residuals! (Think of the model as the line that is fitted and the residuals are the vertical deviations of the data points from the line!)

(If we refer to predicted \(y\)-values, after we have estimated the model fit/line, we can drop the error term: \(\hat{y} = \hat{\beta_0} + x * \hat{\beta_x}\).)

The Relation Between Correlation and Regression

There is a close relation and we can convert \(r\) to \(\hat{\beta_x}\).

\(\hat{r} = \frac{covariance_{xy}}{s_x * s_y}\)

\(\hat{\beta_x} = \frac{covariance_{xy}}{s_x*s_x}\)

\(covariance_{xy} = \hat{r} * s_x * s_y\)

\(\hat{\beta_x} = \frac{\hat{r} * s_x * s_y}{s_x * s_x} = r * \frac{s_y}{s_x}\)

–> Regression slope = correlation multiplied by ratio of SDs (if SDs are equal, \(r\) = \(\hat{\beta}\) )

Standard Errors for Regression Models

We usually want to make inferences about the regression parameter estimates. For this we need an estimate of their variability.

We first need an estimate of how much variability is not explained by the model: the residual variance (or error variance):

Compute residuals:

\[ residual = y - \hat{y} = y - (x*\hat{\beta_x} + \hat{\beta_0}) \]

Compute Sum of Squared Errors (remember?):

\[ SS_{error} = \sum_{i=1}^n{(y_i - \hat{y_i})^2} = \sum_{i=1}^n{residuals^2} \]

Compute Mean Squared Error:

\[ MS_{error} = \frac{SS_{error}}{df} = \frac{\sum_{i=1}^n{(y_i - \hat{y_i})^2} }{N - p} \]

where the \(df\) are the number of observations \(N\) - the number of estimated parameter \(p\) (in this case 2: \(\hat{\beta_0}\) and \(\hat{\beta_x}\)).

Finally, we can calculate the standard error for the full model:

\[ SE_{model} = \sqrt{MS_{error}} \]

We can also calculate the SE for specific regression parameter estimates by rescaling the \(SE_{model}\):

\[ SE_{\hat{\beta_x}} = \frac{SE_{model}}{\sqrt{\sum{(x_i - \bar{x})^2}}} \]

Statistical Tests for Regression Parameters

With the parameter estimates and their standard errors, we can compute \(t\)-statistics, which represent the likelihood of the observed estimate vs. the expected value under \(H_0\) (usually 0, no effect).

\[ \begin{array}{c} t_{N - p} = \frac{\hat{\beta} - \beta_{expected}}{SE_{\hat{\beta}}}\\ t_{N - p} = \frac{\hat{\beta} - 0}{SE_{\hat{\beta}}}\\ t_{N - p} = \frac{\hat{\beta} }{SE_{\hat{\beta}}} \end{array} \]

Usually, we would just let R do the calculations:


Call:
lm(formula = grade ~ studyTime, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.656  -2.719   0.125   4.703   7.469 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    76.16       5.16   14.76  6.1e-06 ***
studyTime       4.31       2.14    2.01    0.091 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.4 on 6 degrees of freedom
Multiple R-squared:  0.403, Adjusted R-squared:  0.304 
F-statistic: 4.05 on 1 and 6 DF,  p-value: 0.0907

The intercept is significantly different from zero (which is usually not very relevant) and the effect of studyTime is not significant. So for every hour that we study more, the effect on the grade is rather small (4…) but possibly not present.

Quantifying Goodness of Fit of the Model

Often, it is useful to check how good the model we estimated fits the data.

We can do that easily by asking how much of the variability in the data is accounted for by the model?

If we only have one IV (\(x\)), then we can simply square the correlation coefficient:

\[ R^2 = r^2 \]

In study time example, \(R^2\) = 0.4 –> we accounted for 40% of the overall variance in grades!

More generally, we can calculate \(R^2\) with the Sum of Squared Variances:

\[ R^2 = \frac{SS_{model}}{SS_{total}} = 1-\frac{SS_{error}}{SS_{total}} \]

Fitting More Complex Models

Often we want to know the effects of multiple variables (IVs) on some outcome.

Example:
Some students have taken a very similar class before, so there might not only be the effect of studyTime on grades, but also of having taken a priorClass.

We can built a model that takes both into account by simply adding the “weight” and the IV (priorClass) to the model:

\(\hat{y} = \hat{\beta_1}*studyTime + \hat{\beta_2}*priorClass + \hat{\beta_0}\)

  • To model priorClass, i.e. whether each individual has taken a previous class or not, we use dummy coding (0=no, 1=yes).
  • This means, for those who have not taken a class, the whole part of the equation (\(\hat{\beta_2} * priorClass\)) will be zero - we will add it for the others.
  • \(\hat{\beta_2}\) is thus the difference in means between the two groups!
  • \(\hat{\beta_1}\) is the regression slope of studyTime across data points/regardless of whether someone has taken a class before.

If we plot the data, we can see that both IVs seem to have an effect on grades:

How can we tell from the plot that both IVs might have an effect?

Interactions Between Variables

We previously assumed that the effect of studyTime on grade was the same for both groups - but sometimes we expect that this regression slope differs per group!

This is what we call an interaction: The effect of one variable depends on the value of another variable.

Example: What is the effect of caffeine on public speaking?
There doesn’t seem to be an effect:

# perform linear regression with caffeine as independent variable
lmResultCaffeine <- lm(speaking ~ caffeine, data = df)
summary(lmResultCaffeine)

Call:
lm(formula = speaking ~ caffeine, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-33.10 -16.02   5.01  16.45  26.98 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   -7.413      9.165   -0.81     0.43
caffeine       0.168      0.151    1.11     0.28

Residual standard error: 19 on 18 degrees of freedom
Multiple R-squared:  0.0642,    Adjusted R-squared:  0.0122 
F-statistic: 1.23 on 1 and 18 DF,  p-value: 0.281

Interactions 2

What if we find research suggesting that anxious people react differently to caffeine than non-anxious people?

Let’s include anxiety in the model:

# compute linear regression adding anxiety to model
lmResultCafAnx <- lm(speaking ~ caffeine + anxiety, data = df)
summary(lmResultCafAnx)

Call:
lm(formula = speaking ~ caffeine + anxiety, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-32.97  -9.74   1.35  10.53  25.36 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)        -12.581      9.197   -1.37     0.19
caffeine             0.131      0.145    0.91     0.38
anxietynotAnxious   14.233      8.232    1.73     0.10

Residual standard error: 18 on 17 degrees of freedom
Multiple R-squared:  0.204, Adjusted R-squared:  0.11 
F-statistic: 2.18 on 2 and 17 DF,  p-value: 0.144

It looks like the effect of caffeine is indeed different for the two anxiety groups: Increasing for non-anxious people and decreasing for anxious ones.

However, the model is not significant!

This is due to the fact that we only look at additive effects (main effects) with this model. Overall, neither caffeine nor anxiety predicts grades.

In other words: The model tries to fit the same slope for both groups, which is a flat line.

Interactions 3

To allow for different slopes for each group (i.e. for the effect of caffeine to vary between the anxiety groups), we have to model the interaction as well.

The interaction is simply the product of the two variables:

# compute linear regression including caffeine X anxiety interaction
lmResultInteraction <- lm(
  speaking ~ caffeine + anxiety + caffeine:anxiety,
  # speaking ~ caffeine * anxiety,  # same!
  data = df
)
summary(lmResultInteraction)

Call:
lm(formula = speaking ~ caffeine + anxiety + caffeine:anxiety, 
    data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.385  -7.103  -0.444   6.171  13.458 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 17.4308     5.4301    3.21  0.00546 ** 
caffeine                    -0.4742     0.0966   -4.91  0.00016 ***
anxietynotAnxious          -43.4487     7.7914   -5.58  4.2e-05 ***
caffeine:anxietynotAnxious   1.0839     0.1293    8.38  3.0e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.1 on 16 degrees of freedom
Multiple R-squared:  0.852, Adjusted R-squared:  0.825 
F-statistic: 30.8 on 3 and 16 DF,  p-value: 7.01e-07

We now see that there are significant main effects for both caffeine and anxiety, as well as the significant interaction between both variables. (We have to be careful of interpreting the main effects when an interaction is also significant!)

The interpretation of the coefficients when interactions are included is not as straight forward!

If you want to report the “typical” ANOVA table with main effects and the general interaction:

anova(lmResultInteraction)
Analysis of Variance Table

Response: speaking
                 Df Sum Sq Mean Sq F value Pr(>F)    
caffeine          1    455     455    6.96 0.0179 *  
anxiety           1    992     992   15.17 0.0013 ** 
caffeine:anxiety  1   4593    4593   70.27  3e-07 ***
Residuals        16   1046      65                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Comparison

Sometimes, we want to compare two (nested!) models to see which one fits the data better.

We can do so by using the anova()* function in R:

anova(lmResultCafAnx, lmResultInteraction) 
Analysis of Variance Table

Model 1: speaking ~ caffeine + anxiety
Model 2: speaking ~ caffeine + anxiety + caffeine:anxiety
  Res.Df  RSS Df Sum of Sq    F Pr(>F)    
1     17 5639                             
2     16 1046  1      4593 70.3  3e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This shows that Model 2, incl. the interaction, is to be preferred.

Note: We can only use this method with nested models, which means that the simpler (reduced) model only contains variables also included in the more complex (full) model.

Criticizing Our Model and Checking Assumptions

“Garbage in, garbage out” - we have to make sure our model is properly specified!

Properly specified = having included the appropriate IVs.

The model also needs to satisfy the assumptions of the statistical method (= GLM).

One important assumption of the GLM is that the residuals are normally distributed (NOT necessarily the data!).

This assumption can be violated by a not properly specified model or because the data are inappropriate for the statistical model.

We can use a Q-Q plot, which represents the quantiles of two distributions/variables (e.g. the data and a normal distribution of the same data) against each other.

If the data points diverge substantially from the line (especially in the extremes), we can conclude that the residuals are not normally distributed.

Model Diagnostics 2

To check the assumptions, we can easily run a function for model diagnostics (incl. Q-Q plots) in R. The function, check_model(), is included in the performance package by the easystats team (who make great packages for everything related to statistical modeling!)

# install.packages("easystats) 
library(performance)  

check_model(lmResultInteraction)

We’re not going into detail about all these diagnostics (and hard to see!), but it is always a good idea to run diagnostics/check assumptions for your models!

What Does “Predict” Really Mean?

We neither mean “predicting before seeing the data/in the future” nor mean to imply causality!

It simply refers to fitting a model to the data: We estimate (or predict) values for the DV (\(\hat{y}\)) and the IVs are often referred to as predictors.

Related to: predicting future values

ANOVA

As you saw earlier, the “normal” ANOVA is a special case of the linear model.

The difference is that the linear model is a bit more flexible in terms of including different IVs/predictors (continuous & categorical), whereas the ANOVA can only use categorical IVs (unless you run an ANCOVA).

An ANOVA is basically an F-Test that we have used previously (e.g. anova(modelfit)).

The F statistic is calculated by differentiating Sum of Squares Within (SSW) and Sum of Squares Between (SSB) categories/factor levels:

\[ SSW = \sum_{i=1}^n{(y_{ij} - \bar{y_j})^2} \]

SSW is the (squared and summed) difference between each data point and it’s category’s mean.

\[ SSB = \sum_{i=1}^n{(\bar{y_j} - \bar{y})^2} \]

SSB is the (squared and summed) difference between each category’s mean and the overall mean.

\[ F = \frac{SSB/(M*(N-1))}{SSW/(M-1)} \]

M = number of categories/groups/factor levels; N = number of participants/observations

Repeated Measures ANOVA & Linear Mixed Model

Next time?

Problem: Dependent measures!

Thanks!